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Abstract 

Cellwise outliers are likely to occur together with casewise outliers in modern 
datasets of relatively large dimension. Recent work has shown that traditional 
robust regression methods may fail when applied to such datasets. We propose a 
new robust regression procedure to deal with casewise and cellwise outliers. The 
proposed method, called three-step regression, proceeds as follows: first, it uses a 
consistent univariate filter, that is, a procedure that flags and eliminates extreme 
cellwise outliers; second, it applies a robust estimator of multivariate location and 
scatter to the filtered data to down-weight casewise outliers; third, it computes 
robust regression coefficients from the estimates obtained in the second step. The 
three-step estimator is consistent and asymptotically normal at the central model 
under some assumptions on the tails of the distributions of the continuous covariates. 
The estimator is extended to handle both continuous and dummy covariates using an 
iterative algorithm. Extensive simulation results show that the three-step estimator 
is resilient to cellwise outliers. It also performs well under casewise contamination 
when compared to traditional high breakdown point estimators. 


1 Introduction 

The vast majority of procedures for robust linear regression are based on the classical 
Tukey-Huber contamination model (THCM) in which a relatively small fraction of cases 
may be contaminated. High breakdown point affine equivariant estimators such as least 
trimmed squares (Rousseeuw, 1984), S-regression (Rousseeuw and Yohai, 1984) and 
MM-regression (Yohai, 1985) proceed by down-weighting outlying cases, which makes 
sense and works well in practice, under THCM. However, in some applications, the con¬ 
tamination mechanism may be different in that random cells in a data table (with rows 
as cases and columns as variables) are independently contaminated. In this paradigm, a 
small fraction of random cellwise outliers could propagate to a relatively large fraction 
of cases, breaking down classical high breakdown point affine equivariant estimators (see 
Alqallaf et ah, 2009). Since cellwise and casewise outliers may co-exist in some appli¬ 
cations, our goal in this paper is to develop a method for robust regression estimation 
and inference that can deal with both cellwise and casewise outliers. 
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There is a vast literature on robust regression for casewise outliers, but only a scant 
literature for cellwise outliers and none for both types of outliers in the regression con¬ 
text. Recently, Ollerer et al. (2015) combined the ideas of coordinate descent algorithm 
(called the shooting algorithm in Fu, 1998) and simple S-regression (Rousseeuw and 
Yohai, 1984) to propose an estimator called the shooting S. The shooting S-estimator 
assigns individual weight to each cell in the data table to handle cellwise outliers in 
the regression context. The shooting S-estimator is robust against cellwise outliers and 
vertical response outliers. 

In this paper, we propose a three-step regression estimator which combines the 
ideas of filtering cellwise outliers and robust regression via covariance matrix estimate 
(Maronna and Morgenthaler, 1986; Croux et ah, 2003), namely 3S-regression estimator. 
By hltering, here we mean detecting outliers and replacing them by missing values as 
in Agostinelli et al. (2015). Our estimator proceeds as follows: first, it uses a univariate 
filter to detect and eliminate extreme cellwise outliers in order to control the effect of 
outliers propagation; second, it applies a robust estimator of multivariate location and 
scatter to the filtered data to down-weight casewise outliers; third, it computes robust 
regression coefficients from the estimates obtained in the second step. With the choice 
of a filter that has simultaneous good sensitivity (is capable of filtering outliers) and 
good specificity (can preserve all or most of the clean data), the resulting estimator can 
be resilient to both cellwise and casewise outliers; furthermore, it attains consistency 
and asymptotic normality for clean data. In this regards, we propose a hlter that is 
consistent under some assumptions on the tails of the covariates distributions. By con¬ 
sistent filter, we mean a filter that asymptotically can preserve all the data when they 
are clean. 

The rest of the paper is organized as follows. In Section 2, we introduce a family of 
consistent filters. In Section 3, we introduce 3S-regression. In Section 4, we show some 
asymptotic properties of 3S-regression. In Section 5, we evaluate the performance of 
3S-regression in an extensive simulation study. In Section 6, we analyze a real data set 
with cellwise and casewise outliers. In Section 7, we conclude with some remarks. We 
also provide a document referred to as “supplementary material”, which contains all the 
proofs, additional simulation results, and other related material. 

2 Consistent filter 

Filtering is a method for pre-processing data in order to control the effect of potential 
cellwise outliers. In this paper, we pre-process the data by flagging outliers and replacing 
them by missing values, NAs. This method of filtering has recently been used for robust 
estimation of multivariate location and scatter (Danilov, 2010; Agostinelli et ah, 2015) 
and for clustering (Farcomeni, 2014a,b). Also, Farcomeni (2015) proposed a procedure 
to determine a data-driven choice for the number of hltered cells to increase the efficiency 
of the estimator. 

Consistent filters are ones that do not filter good data points asymptotically. Gervini 
and Yohai (2002) introduced a consistent filter for normal residuals in regression esti¬ 
mation to achieve a fully-efhcient robust regression estimator. Consistent filters are 
desirable because their good asymptotic properties are shared by the following-up es¬ 
timation procedure. In this paper, we introduce a new family of consistent hlters for 
univariate data. 

Consider a random variable X with a continuous distribution function G{x). We 


2 



define the scaled upper and lower tail distributions of G{x) as follows: 


F“(t) 

F\t) 


Pg 

Pg 


X -T]^ 


< t\x > ri^ 


med(X — ?7“|X > ?]“) 

( if ~ X A 

med(?7^ — X\X < r]^) ~ ^ ^ j ■ 


and 


(1) 


Here, med stands for median, r/“ = G“^(l — a), rf = G~^{a)^ and 0 < a < 0.5. We 
use a = 0.20, but other choices could be considered. To simplify the notation, we set 
s“ = med(X — 'rf'\X > r/“) and = med(ry^ — X\X < rf). Alternatively, a combined 
tails approach could be used for symmetric distributions as in Gervini and Yohai (2002). 

Let {Xi,..., Xn} be a random sample from G, and let < X( 2 ) < • • • < X(„) be 
the corresponding order statistics. Consistent estimators for (?]“, s“,s*) are given by 


Vn = Gn^(l - a)> € = med({Xi - r)“|Xi > ?)“}), 

vL = G“^(a), si = med{{fil - Xi\Xi < rf,}), 


where G„^(a) = 0 < a < 1 , is the empirical quantile and med({Yi,..., Ym}) = 

^([m/ 2 ]) is the sample median (see Lemma 1.1 in the supplementary material for a proof 
of the consistency for s“ and §1). The empirical distribution functions for the scaled 
upper and lower tails in ( 1 ) are now given by 


pi ^ Jl^ii{0<{€-x^)/sl<t) 


and 


Upper and lower tails outliers can be flagged by comparing the empirical distribution 
functions for the scaled tails with their expected distributions. We assume that aside 
from contamination, and decay exponentially fast or faster. Let { 0 }“*“ = max(0, a) 
denote the positive part of a. Then, we define the proportions of flagged upper and lower 
tails outliers by 


df = sup 
t>to 


|Lo(t) - F“(t)| and df =sup|Fo(t) 


-Ft 


where To(t) = 1 — exp(—log(2)t) and to = ^l^og{2). When X — r/“|X > rp is exponen¬ 
tially distributed with a rate of A“ > 0, the standardized tail {X — ry“)/s“|X > r/“ have 
exponential distribution with a rate of log( 2 ), leading to our choice of Lo(t) and to- Fi¬ 
nally, we filter x 100 % of the most extreme points in the upper tail {Xi\Xi > 17 “}, and 
filter dl x 100% of the most extreme points in the lower tail {Xi\Xi < fjl}. Equivalently, 
setting 

t“ = min {t : F“(t) > 1 - d“} and il = min {t : Fl{t) > 1 - , 

we filter X^’s with Xi <ff, — sfil or X^ > 17 “ + s“t“. 

We tried several heavy tail models for Uo(t) including Pareto distributions with 
different tail indexes, and we found that the chosen exponential model strikes a good 
balance between the robustness and consistency of the filtering procedure. 

Theorem 2.1 (proved in the supplementary material) below shows that our filter is 
consistent under the following assumption on the tails of G{x). 
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Assumption 2.1. G{x) is continuous, and F'^{t) and F\t) satisfy the following: 

Fo(t) — F“(t) < 0 , t>to and Fo(i) “ ^ > to- 

Theorem 2.1. Suppose that Assumption 2.1 holds for G{x). Then, (i“ —)■ 0 a.s. and 
—> 0 a.s. 

In practice, the distributions F‘^{t) and F\t) are unknown. To allow for some 
flexibility, Assumption 2.1 does not completely specify F'^{f) and F\f), but it only 
requires that their upper tails are as heavy as or lighter than the upper tail of F(j{t). 

3 Three-step regression 

3.1 The estimator 

Consider the model 

Yi = a + + Ei ( 2 ) 

for i = 1 ,... ,n, where the error terms £i are i.i.d. and independent of the covariates 
Xi = [Xii,... ,XipY. The least squares (LS) estimates [ais^^Ls) defined as the 
minimizers of the sum squares of residuals. 


{aLS,h\s)= argmin '^{Yi - a - X\^f. 
(a,/3‘)6K(p+i) 

The solution to this problem is explicit: 

^LS = '^xx'^xy, 

OlS = Ay - 


(3) 


Here, 'Sxx,'Ylxy, f-y, and fi,,. are the components of the empirical covariance matrix and 
mean: 


t = 1 

( ^xx 

Y.xy j 

and a = 

^ i^x\ 


[ ^yx 

^yy / 

r~ \ 

k Ay / 


for the joint data {Zi ,..., Z„} with Zi = {X\, 

Several authors (see Maronna and Morgenthaler, 1986; Croux et ah, 2003) proposed 
to achieve robust regression and inference for casewise outliers by robustifying the com¬ 
ponents in (3). Croux et al. (2003) replaced the empirical covariance matrix and mean by 
the multivariate S-estimator (Davies, 1987). We will refer to this approach as two-step 
regression (2S-regression). Croux et al. (2003) have shown that under mild assump¬ 
tions (including symmetry of £i and independence of £i and Xi) 2S-regression is Fisher 
consistent and asymptotically normal even if the S-estimators of multivariate location 
and scatter themselves are not consistent. Furthermore, 2S-regression is resilient to all 
kinds of outliers, that is, vertical outliers, bad leverage points, and good leverage points. 
Note that down-weighting good leverage points could lead to some efhciency loss, but it 
may also prevent the underestimation of the variance of the estimator, which could be 
problematic for inferential purposes (see for example, Ruppert and Simpson, 1990). 

To deal with casewise and cellwise outliers, we propose to use a generalized S- 
estimator that uses the consistent hlter described in Section 2. The estimator is similar 
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to that in Agostinelli et al. (2015), but with the filter which is consistent for a broader 
range of distributions. This generality is needed in the regression setting. Our proposed 
globally robust regression estimator, called 3S-regression, is given by: 

^SS /r\ 

, .. (5) 

ass = my-m^P^s- 


Here, (m, S) is a generalized S-estimator computed as follows: 

Step 1. Filter extreme cellwise outliers to prevent cellwise contaminated cases from hav¬ 
ing large robust Mahalanobis distances in Step 2, and 

Step 2. Down-weight casewise outliers by applying generalized S-estimator (GSE) for 
multivariate location and scatter (Danilov et ah, 2012) to the hltered data from 
Step 1. The GSE is a generalization of the S-estimator for incomplete data that 
are missing completely at random (MCAR). Since the independent contamina¬ 
tion model (IGM) assumes that cells are outlying completely at random, the 
MCAR assumption is fulfilled if the ICM model holds. 


More precisely, consider a set of covariates {Xi,...,X„}. We perform univariate 
filtering as described in Section 2 on each variable, {Ay,... ,Xnj}, j = I,... ,p. Let 
{Ul,...,Un} be the resulting auxiliary vectors of zeros and ones with zeros indicating 
the hltered entry in Xi. More precisely, Ui = {Un ,..., UipY, where 


II- ■ — T(f)''- — P- < A- < -I- 1 


The goal of the hlter is to prevent propagation of cellwise outliers. If the fraction 
of cases with at least one hagged cell is very small (below 1%, say) then propagation of 
cellwise outliers is not an issue and the hlter can be safely turned oh. The procedure that 
turns the hlter oh when the fraction of ahected cases is below a given small threshold, is 
considerably simpler to analyze from the asymptotic point of view. Moreover, it retains 
all the robustness properties derived from the hlter. Let no = #{1 < i < n :Ui = \] 
be the number of complete observations after hltering. We set 


U* = 1/ 


n — no 
n 




U,I 


n — riQ 
n 






( 6 ) 


with ^ equal to some small threshold. In this paper we use ^ = 0.01. 

Finally, let Z = {Zi,... ,ZnY and U = ((I7{,...,U*)*, 1). The generalized S- 
estimator can now be dehned as 


m = mGs(^,U), 

5 = 5g5(Z,U), 

where fhos and Sqs are robust multivariate location and scatter generalized S-estimator 
for incomplete data, (Z,U), with Tukey’s bisquare rho function PbY) = min(l, 1 — (1 — 
t)^) and 50% breakdown point (see Danilov et ah, 2012, for full dehnition). Note that 
when U = (1,..., 1) (i.e., when the input data is complete), the generalized S-estimator 
reduces to S-estimator (Danilov et ah, 2012). 
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3.2 Models with continuous and dummy covariates 

For models with continuous and dummy covariates, the direct computation of 3S- 
regression is likely to fail because the sub-sampling algorithm (needed to compute the 
generalized S-estimator) is likely to yield collinear subsamples. In this case, we endow 
3S-regression with an iterative algorithm similar to that in Maronna and Yohai (2000) 
to deal with continuous and dummy covariates. 

Consider now the following model: 

Yi = O: -h X\^x + + £i (8) 

for f = 1 ,..., n where Xi = {Xu ,..., Xip^Y is a px dimensional vector of continuous 
covariates and Di = {Du,..., Dip^Y is a pd dimensional vector of dummy covariates. 

Set X = (Xi,... ,XnY, D = {Di,... ,DnY: and Y = {Yi,... ,YnY- We assume that the 
columns in X and D are linearly independent. 

We modify the alternating M- and S-regression approach proposed by Maronna 
and Yohai (2000). Our algorithm uses 3S-regression to estimate the coefhcients of the 
continuous covariates and regression M-estimators with Huber’s rho function PH{t) = 
min(l,t^/2) (Huber and Ronchetti, 2009) to estimate the coefficients of the dummy 
covariates. More specifically, the algorithm works as follows: 

(dW,^f)=5(X,F-D^r'\ 

^?^ = M(D,y-dW-i;0f), for k = l,...,K, 

where g(X,y) denotes the operation of 3S-regression for a response vector (Y,X) as 
defined in (5) and M{I},Y) denotes the operation of regression M-estimator with no 
intercept for (Y,D). We let X be the imputed X with the filtered entries imputed by 

(k) '' i^) 

the best linear predictor using w ' and S , the generalized S-estimates at the fc-th 
iteration as defined in (7). We use X instead of X to control the effect of propagation of 
cellwise outliers. 

As in Maronna and Yohai (2000), to calculate the initial estimates, {a^^Y$x iPd )> 
we first remove the effect of Di from the continuous covariates and the response variable. 

Let 

Y ^Y -m and X = X - DT, 

where t — M (D, Y) and T is ap^^xpaj-matrix with the j-th column as Tj = M(D, (Yy,..., XnjY)- 
Now, the initial estimates are defined by 

{a^^\0^x^')=g{%Y), 

=M(D,Y-d(0)-i;8f). 

Finally, the procedure in (9) is iterated until convergence or until it reaches a maxi¬ 
mum of AT = 20 iterations. We choose K = 20 because our simulation has shown that 
the procedure usually converges for K < 20, provided good initial estimates are used. 

4 Asymptotic properties of three-step regression 

Theorem 4.1 (proved in the supplementary material) establishes the equivalence between 
3S-regression and 2S-regression (Croux et ah, 2003) for the case of continuous covariates. 
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Let {a^Sj^ss) be the 3S-regression estimate and {a2S:02s) be the 2S-regression estimate 
based on the sample {Zi, ... ,Zn}, where Zi = {Xj,Yi). Let G{x) and Gj(x) be the 
distribution functions for Xi for Xij respectively. 

Theorem 4.1. Suppose that Assumption 2.1 holds for each Gj, j = l,...,p. Then, 
with probability one, for sufficiently large n, a^s = <^ 2 S = $ 2 S- 

Since 3S-regression becomes 2S-regression for sufficiently large n, 3S-regression in¬ 
herits the established asymptotic properties of 2S-regression. Corollary 4.2 states the 
strong consistency and asymptotic normality of 3S-regression. The corollary requires 
the following regularity assumptions that are needed for deriving the consistency and 
asymptotic normality of 2S-regression (see Croux et ah, 2003). 

Assumption 4.1. Let be the distribution of the error term £i in (2). The distribution 
Fg has a positive, symmetric and unimodal density fe- 

Assumption 4.2. For all v and <5 € M, Pg{X\v = 6) < 1/2. 

Corollary 4.2. Suppose that Assumption 2.1 holds for each Gj, j — l,...,p, and 
Assumption 4.1-f.2 hold. Denote 9ss = ^ Then, 

(a) §33 -> 9 a.s.. 

(b) Let H be the distribution of{X*,Y) and let {mH,SH) be the S-estimator functional 
(see Lopuhad, 1989). We use the same partition outlined in (4) for [tuhtSh)- Set 
X^{l,Xy. Then, 

V^ihs-0) NiO,ASV{H)), 

where 

ASV{H) = C{H)-^D{H)C{H)-^, 

and where 

G{H) = Eh [w{dH{Z))Xx'] + [w'{dH{Z)){Y - x'efxx'] , 

D{H) = Eh {w\dH{Z)){Y - x'efxx'} , 

^SH,yy “ P*Sh,xxP, 
dniZ) = {Z- mnYS-HZ - mn), 
w{t) = pYit)- 

Here, pb{ 1) is the Tukey’s bisquare rho function. 

Remark 4.1. Croux et al. (2003) proved the Fisher consistency of 2S-regression, but 
the strong consistency also follows from that and Theorem 3.2 in Lopuhad (1989). 

The asymptotic covariance matrix needed for inference can be estimated in the fol¬ 
lowing natural way. Let {m,S) be the generalized S-estimate and ( 035 , ^ 035 ) be the 
3S-regression estimate. Then, replace Zi = (X-,1/) by Zi = and Xi = {1,X\)^ 

by X, = (i,i: iY, where Xi is the best linear prediction of Xi (which is possibly in¬ 
complete due to hlter) using {m,S). The identified cellwise outliers in Xi are filtered 
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and imputed in order to avoid the effect of propagation of outliers on the asymptotic 
covariance matrix estimation. Now, 


ASV{H) = C{H) \ 

where 

C{H) Mdn{Z^)) + —w'{dn{Z^))r‘i 

I ^e,n j 

——— 1 ^ 

D(H) ^ - y^w\dn{Zi))r^,XiX,, 

n ^ 


^s,n — \l ^yy 0SS^^x0sS^ 

dniZi) ^ {Zi-rhyS 

h = Yi - Xi9ss- 

Although the asymptotic covariance matrix formula is valid under clean data, we shall 
show in Section 5 that our proposed inference remains approximately valid in the pres¬ 
ence of a moderate fraction of cellwise and casewise outliers. 

In the case of continuous and dummy covariates, Maronna and Yohai (2000) de¬ 
rived asymptotic results for the alternating regression M- and S-estimates. However, 
there is no proof of asymptotic results when regression S-estimators are replaced by 
2S-regression. The study of the asymptotic properties of the alternating M- and 2S- 
regression is worth of future research. 

5 Simulation 

We carried out extensive simulation studies in R (R Core Team, 2015) to investigate 
the performance of 3S-regression by comparing it with least square (LS) and two robust 
alternatives: 

(i) 2S-regression as in Croux et al. (2003). The location and scatter S-estimator 
with bisquare p function and 50% breakdown point is computed by an iterative 
algorithm that uses an initial MVE estimator. The MVE estimator is computed 
by sub-sampling with a concentration step. This procedure is implemented in the 
R package rrcov, function CovSest, option method="bisquare" (Todorov and 
Filzmoser, 2009); and 

(ii) Shooting S-estimator introduced in Offerer et al. (2015) with bisquare p function 
and 20% breakdown point (for each simple regression) as suggested by the authors 
to attain a good trade-off between robustness and efficiency. The R code is available 
at http://feb.kuleuven.be/Viktoria.Oellerer/software. 

The generalized S-estimates needed by 3S-regression are computed using the R package 
GSE, function GSE with default options (Leung et ah, 2015). The regression M-estimates 
needed by the alternating M- and 3S-regression are computed using the R package MASS, 
function rim, option method="M" (Venables and Ripley, 2002). 



5.1 Models with continuous covariates 

We consider the regression model in (2) with p = 15 and n = 150,300, 500,1000. The 
random covariates Xi, i = 1 ,..., n, are generated from multivariate normal distribution 
iVp(/i,S). We set p — 0 and Hjj = 1 for j = 1, ... ,p without loss of generality because 
GSE in the second step of 3S-regression is location and scale equivariant. To address 
the fact that 3S-regression and the shooting S-estimator are not affine-equivariant, we 
consider the random correlation structure for S as described in Agostinelli et al. (2015). 
We fix the condition number of the random correlation matrix at 100 to mimic the 
practical situation for data sets of similar dimensions. Furthermore, to address the 
fact that the two estimators are not regression equivariant, we randomly generate P as 
^ = Rb, where b has a uniform distribution on the unit spherical surface and R is set 
to 10. We set a = 0 because GSE is location equivariant. The response variable Yi is 
given by 1 ^ = + ei, where are independent (also independent of X^’s) identically 

normally distributed with mean 0 and a = 0.5. Finally, we consider the following 
scenarios; 

• Clean data: No further changes are done to the data; 

• Cellwise contamination: Randomly replace a fraction e of the cells in the covariates 

by outliers = E[Xij) + x SD{Xij) and e proportion of the responses by 

outliers 1)“"* = E{Yij) + A: x SD{ei), where k = 1, 2,..., 10; 

• Casewise contamination: Randomly replace a fraction e of the cases by leverage 

outliers where Xf”* = cv and T™"* = + with - 

N{k, a^), where k = 1,2,..., 15. Here, v is the eigenvector corresponding to the 
smallest eigenvalue of S with length such that {v — /x)*S“^(t; — p) = 1. Monte 
Carlo experiments show that the placement of outliers in this direction, v, is the 
least favorable for our estimator. We repeat the simulation study in Agostinelli 
et al. (2015) for dimension 16 and observe that c = 8 is the least favorable value 
for the performance of the scatter estimator. 

We consider e = 0.01,0.05 for cellwise contamination, and e = 0.10 for casewise contam¬ 
ination. The number of replicates for each setting is N = 1000. 


5.1.1 Coefficient estimation performance 

We examine the effect of cellwise and casewise outliers on the bias of the estimated 
coefficients. We evaluate the bias using the Monte Carlo mean squared error (MSE): 


MSE = 


1 

N 


N 


m=l ^ j = l 


(m) 




(™)^2 


where is the estimate for at the m-th simulation run. 

Table 1 shows the MSE for clean data and the maximum MSE for all the cellwise 
and casewise contamination settings for n = 150,300. Figure 1 shows the curves of 
MSE for various cellwise and casewise contamination values for n = 300. The results 
for n = 150 are similar and the corresponding figure is shown as supplementary material. 

In the cellwise contamination setting, 3S-regression is highly robust against moderate 
and large cellwise outliers (A: > 3), but less robust against inliers {k < 2). Notice that 
inliers also affect the performance of the shooting S-estimator but to a lesser extent. 
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Table 1: Maximum MSE in all the considered scenarios for models with continuous 
covariates. 


n — 

Clean 

1% Cellwise 

5% Cellwise 

Casewise 

150 

300 

150 

300 

150 

300 

150 

300 

3S 

0.012 

0.005 

0.039 

0.020 

0.902 

0.797 

0.223 

0.143 

Shoots 

0.034 

0.017 

0.134 

0.080 

1.129 

0.912 

1.570 

1.460 

2S 

0.010 

0.004 

0.025 

0.014 

3.364 

3.041 

0.109 

0.122 

LS 

0.009 

0.004 

2.723 

2.440 

4.812 

4.732 

8.286 

8.182 


Estimators — 3S — ShootS - 2S - — LS 


1% Cellwise 


5% Cellwise 


Casewise 

/ 

1 . 5 - 

/ / 


t 

1 

/ 


/7 

1 . 5 - 

An 

/ 

1 . 0 - 

// _ 

1 . 0 - 

/ \ 


i/v' ^ ^ ^ - 


/ \ 


0 . 5 - 

0 . 0 - 


0 . 5 - 

0 . 0 - 



k 


Figure 1: MSE for various cellwise and casewise contamination values, k, for models 
with continuous covariates. The sample size is n = 300. 


Since the filter does not flag inliers, 3S-regression and 2S-regression perform similarly 
in the presence of inliers (see the central panel of Figure 1). The shooting S-estimator 
is highly robust against large outliers, but less so against moderate cellwise outliers. As 
expected, 2S-regression breaks down in the case of e = 0.05, when the propagation of 
large cellwise outliers is expected to affect more than 50% of the cases. 

In the casewise contamination setting, 2S-regression has the best performance, as ex¬ 
pected. 3S-regression also performs fairly well in this setting. The shooting S-estimator 
performs less satisfactorily in this case. 

We have also considered other simulation settings and observed similar results (not 
shown here). In particular, we considered p — 5 with n = 50,100 and p = 25 with 
n = 250,500 under the same set of scenarios (clean data, cellwise contamination, and 
casewise contamination). Moreover, we studied the performance of 3S-regression for 
larger casewise contamination levels up to 20%. 3S-regression maintains its competitive 
performance, outperforming Shooting S and not falling too far behind 2S-regression, 
which is expected to win in these situations. 


5.1.2 Performance of confidence intervals 


We then assess the performance of confidence intervals for the regression coefficients 
based on the asymptotic covariance matrix as described in Section 4. Intervals that 
have a coverage close to the nominal value, while being relatively short, are desirable. 
The 100(1 — r)% confidence interval (Cl) of 3S-regression has the form: 


cm) 


p. _ $-1(1 - T/2)sjASV0,)/n, + <I>-i(l - T/2)^A^0j)/n] , 
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Figure 2: CR for clean data and for cellwise and casewise contaminated data of various 
sample size, n. 


Table 2: Average lengths of confidence intervals for clean data and for cellwise and 
casewise contamination. 


Size (n) 

Clean 

1% Cell., fc = 5 

5% Cell., fc = 5 

10% Case., fc = 3 

3S 

2S 

3S 

2S 

3S 

2S 

3S 

2S 

150 

0.341 

0.352 

0.355 

0.402 

0.450 

1.519 

0.329 

0.355 

300 

0.242 

0.247 

0.244 

0.275 

0.294 

1.148 

0.239 

0.253 

500 

0.187 

0.189 

0.190 

0.212 

0.222 

0.912 

0.189 

0.197 

1000 

0.133 

0.133 

0.134 

0.150 

0.155 

0.662 

0.137 

0.140 


for j = 0,1,... ,p, where /3o = a. We consider r = 0.05 here. We evaluate the perfor¬ 
mance of Cl using the Monte Carlo mean coverage rate (CR): 


1 " 1 ' 


m=l ^ j=l 


and the Monte Carlo mean Cl lengths: 

N ., p 




N ^ p . 

m=l j = l 


Figure 2 shows the CR in the case of clean data, 5% cellwise contamination (fc = 5), 
and 10% casewise contamination {k = 3) simulation, with different sample sizes n = 
150,300,500,1000. The nominal value of 95% is indicated by the horizontal line in the 
figure. 

For clean data, the coverage rates of all the intervals reach the nominal level when 
the sample size grows, as expected. For data with casewise outliers, 2S-regression yields 
the best coverage rate, which is closest to the nominal level. However, 3S-regression 
has an acceptable performance, comparable with that of 2S-regression. For data with 
cellwise outliers, 3S-regression yields intervals with a coverage rate relatively closer to 
the nominal value than LS and 2S-regression. 

Furthermore, the length of the intervals obtained from 3S regression is comparable 
to that LS for clean data and that of 2S-regression for clean data and data with case- 
wise outliers. For data with cellwise outliers, 3S-regression yields intervals with lengths 
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relatively closer to the case of clean data. Table 2 shows the average lengths of the 
confidence intervals obtained from 3S- and 2S-regression in the case of clean data, 1% 
cellwise contamination (k = 5), 5% cellwise contamination {k = 5), and 10% casewise 
contamination {k = 3) simulation, with different sample sizes n = 150,300,500,1000. 
The results of LS are not included here. 

In general, 3S-regression yields slightly shorter intervals than 2S-regression in all 
scenarios because the asymptotic variance is calculated on the data with the hltered 
cells imputed instead of the complete data. On the other hand, 2S-regression tends to 
yield longer intervals in the cellwise contamination model, even when the propagation 
of outliers is below the 0.5 breakdown point under THCM, for example, when e = 0.01. 
This maybe because 2S-regression loses a signihcant amount of clean data for estimation 
when it down-weights cases with outlying components. 

5.2 Models with continuous and dummy covariates 

We now conduct a simulation study to assess the performance of our procedure when 
the model includes continuous and dummy covariates. We consider the regression model 
in ( 8 ) with px = 12, pd = 3, and n = 150,300. The random covariates {Xi,Di), 
i = 1,... ,n, are first generated from multivariate normal distribution Np{0,Ti), where 
S is the randomly generated correlation matrix with a fixed condition number of 100 . 
Then, we dichotomize Dij at $“^( 7 rj) where ttj = ^, 5,5 for j = 1,2,3, respectively. 
Finally, the rest of data are generated in the same way as described in Section 5.1. 

In the simulation study, we consider the following scenarios: 

• Clean data: No further changes are done to the data; 

• Cellwise contamination: Randomly replace a e fraction of the cells in X by outliers 

= E{Xij) + k X SD{Xij) and e proportion of the responses by outliers 
Y.jmt ^ SD{£i), where A: = 1,2,..., 10; 

• Casewise contamination: Let Sj, be the sub-matrix of S with rows and columns 

corresponding to the continuous covariates. Randomly replace a e fraction of the 
cases in X by leverage outliers = cv, where v is the eigenvector corresponding 

to the smallest eigenvalue of T,x with length such that {v — iJ,xY^~^{v — ^x) = 1 - 
In this case, the number of continuous variables is 13 (instead of 16) and the 
corresponding least favorable casewise contamination size is found to be c = 7 
(instead of 8 ) using the same procedure as in Section 5.1. Finally, we replace the 
corresponding response value by + £“”* with 

N{k, a^), where k = 1,2,..., 10. 

Again, we consider e = 0.01,0.05 for cellwise contamination, and e = 0.10 for casewise 
contamination. The number of replicates for each setting \s N = 1000. 

Table 3 shows the MSE for clean data and the maximum MSE for all the cellwise 
and casewise contamination settings for n = 150,300. Figure 3 shows the curves of 
MSE for various cellwise and casewise contamination values for n = 300. The results 
for n = 150 are similar and the corresponding hgure is shown as supplementary mate¬ 
rial. Overall, 3S-regression remains competitive in the case of continuous and dummy 
covariates. 

We also consider the case of non-normal covariates. The covariates are generated 
from several asymmetric distributions, and the data are contaminated in a similar fash¬ 
ion. The performance of 3S-regression in the case of non-normal covariates is similar to 
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Table 3: Maximum MSE in all the considered scenarios for models with continuous and 
dummy covariates. 


n — 

Clean 

1% Cellwise 

5% Cellwise 

Casewise 

150 

300 

150 

300 

150 

300 

150 

300 

3S 

0.010 

0.004 

0.018 

0.008 

0.636 

0.507 

0.090 

0.071 

Shoots 

0.012 

0.005 

0.026 

0.015 

0.746 

0.468 

0.450 

0.387 

2S 

0.008 

0.003 

0.014 

0.007 

1.894 

1.341 

0.060 

0.054 

LS 

0.007 

0.003 

2.785 

2.532 

5.162 

4.981 

1.332 

1.322 


Estimators — 3S — ShootS - 2S - — LS 




Figure 3: MSE for various cellwise and casewise contamination values, k, for models 
with continuous and dummy covariates. The sample size is n = 300. 


the performance in the case of normal covariates. Results are available as supplementary 
material. 

6 Analysis of the Boston housing data 

We illustrate the effect of cellwise outlier propagation on classical robust estimators us¬ 
ing the Boston Housing data. The data, available at the UCI repository (Bache and 
Lichman, 2013), was collected from 506 census tracts in the Boston Standard Statistical 
Metropolitan Area in the 1970s on 14 different features. We consider the nine quan¬ 
titative variables that were extensively studied (e.g., see in Ollerer et ah, 2015). The 
variables are listed and described in Table 2 in the supplementary material. There is no 
missing data. The original objective of the study in Harrison and Rubinfeld (1978) was 
to analyze the association between the median housing values (medv) in Boston and the 
residents’ willingness to pay for clean air. 

We ht the following model using 3S-regression, the shooting S-estimator, 2S-regression 
and the LS estimator; 

log{medv) = a + (ii log{crim) + /32 nox^ -|- Ps rm^ + Px,a age 

+ /Ss log{dis) + Pq tax -|- P^ptratio -|- Pg black + Pg log{lstat) + e. 

The regression coefficient estimates and their P-values are given in Table 4. In particular, 
we observe that the regression coefficients for the covariates age and black are very 
different under 3S and 2S-regression. Moreover, age is significant under 2S-regression 
but highly non-significant under 3S-regression. 2S-regression is somewhat inefficient 
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Table 4; Estimates and p-values of the regression coefficients for the original Boston 
Housing data. 


Variable 

3S 

Shoots 

2S 

LS 

Coeff. 

P-Val. 

Coeff. P-Val. 

Coeff. 

P-Val. 

Coeff. 

P-Val. 

log(lstat) 

-0.243 

<0.001 

-0.266 

-0.153 

<0.001 

-0.395 

<0.001 

rm^ 

0.015 

<0.001 

0.013 

0.018 

<0.001 

0.007 

<0.001 

tax 

-0.051 

<0.001 

-0.021 

-0.046 

<0.001 

-0.028 

0.006 

log(dis) 

-0.125 

<0.001 

-0.157 

-0.126 

<0.001 

-0.139 

<0.001 

ptratio 

-0.026 

<0.001 

-0.027 

-0.025 

<0.001 

-0.029 

<0.001 

nox^ 

-0.578 

0.013 

-0.463 

-0.445 

0.023 

-0.451 

<0.001 

age 

-0.023 

0.645 

-0.040 

-0.152 

0.001 

0.050 

0.391 

black 

-0.726 

0.398 

0.787 

-0.007 

0.993 

0.500 

<0.001 

log(crim) 

-0.006 

0.513 

0.004 

0.005 

0.527 

-0.002 

0.813 


Table 5: Pairwise squared norm distances between the estimates for the original Boston 
housing data. 


3S 

Shoots 

2S 

LS 

3S 

1.389 

3.145 

6.725 

Shoots 

- 

4.312 

4.661 

2S 


- 

16.614 

LS 



- 


because it throws away a substantial amount of clean data due to the propagation of 
cellwise outliers. It fully down-weights 16.4% of the cases in the dataset (cases that 
receive a zero weight by the multivariate S-estimator). Slightly more than half of these 
cases (8.7%) are affected by the propagation of cellwise outliers mainly in the covariates 
nox‘^ and black (1.3% of the cells in the dataset are flagged by the consistent filter). After 
filtering, these cases have relatively small partial Mahalanobis distances, indicating they 
are close to the bulk of the data for the remaining variables. 

We further compare the four estimators by computing their squared norm distances, 
n X ^ MAD{{Xij, ..., Xnj})^ (see Ollerer et ah, 2015), where MAD 

is the median absolute deviation. Table 5 shows the squared norm distances for the 
considered estimators. Overall, the three robust estimators are very different from LS. 
As expected, 3S-regression and shooting S are closer to each other than they are to 
2 S-regression. Additional analysis provided as supplementary material indicates that 
the observed differences between the three robust estimators are indeed mostly caused 
by the propagation of cellwise outliers in the Boston housing data. 

7 Concluding remarks 

High breakdown point affine equivariant robust estimators are neither efficient nor ro¬ 
bust in the independent cellwise contamination model (ICM). By efficiency here we mean 
the ability to use the clean part of the data. In fact, classical robust estimators are inef¬ 
ficient under ICM because they may down-weight an entire row with a single component 
being contaminated. Therefore, they may lose some useful information contained in the 
data. Furthermore, the classical high breakdown point affine equivariant robust estima¬ 
tors may break down under ICM. A small fraction of cellwise outliers could propagate, 
affecting a large proportion of cases. For instance, the probability e that at least one 
component of a case is contaminated ise = 1 — (1 — e)^, where e is the proportion of 
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independent cellwise outliers. This implies that even if e is small, e could be large for 
large p, and could exceed the 0.5 breakdown point under THCM. For example, if e = 0.1 
and p = 10, then e = 0.65; and if e = 0.05 and p = 20, then e = 0.64. 

To overcome these deficiencies of the classical robust estimators, we introduce a 
three-step regression estimator that can deal with cellwise and casewise outliers. The 
first step of our estimator is aimed at reducing the impact of outliers propagation posed 
by ICM. The second step is aimed at achieving robustness under THCM. As a result, 
the robust regression estimate from the third step is shown to be efficient (in terms 
of data usage) and robust under ICM and THCM. We also prove that our estimator 
is consistent and asymptotically normal at the central regression model distribution. 
Finally, we extend our estimator to models with continuous and dummy covariates and 
provide an algorithm to compute the regression coefficients. 

The proposed procedures are implemented in the R package robregSS, which is freely 
available on CRAN (the Comprehensive R Archive Network, R Core Team, 2015). 
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Abstract 

This supplementary material contains all the proofs, additional simulation re¬ 
sults, and related supplementary material referenced in the article “Robust regres¬ 
sion estimation and inference in the presence of cellwise and casewise contamina¬ 
tion” . 


1 Proofs of Lemmas and Theorems 


1.1 Proof of Theorem 2.1 


We need to following lemma in the proof. 

Lemma 1.1. Let X,Xi ,..., A„ be independent with a eontinuous distribution function 
G{x). Given 0 < a < 1, let ij = G“^(l — a) and s = med{X — r]\X > r/). Now, consider 
the following estimator: fjn = Gf^{l — a) and s„ = med{{Xi — rin\Xi > fjn})- Then, 
Sfi —y s Q.s. 

Proof. Without loss of generality, assume that Xi < X 2 < ■ ■ ■ < A„. So, ?)„ = 
Gf\l -a) = X^„ii-a)], and 

ff{Xi\Xi > ? 7 „} = n - |'n(l - a)] = n- {n+ \-na]) = [na\. 


Then, X^ = raed{{Xi\Xi > r)„}) where 

k = \n{\ — a)] -h 
= n — [naj -h 


[naj 

2 

[naJ 


— n — 


[naJ 


na 

= n — 


2 


V 2 J 


In other words, med{{Xi\Xi > f)„}) = = G^ {I - a/2), and 


Sn = Gn^{l - a/2) - fjn. 
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Therefore, s„ —)• s a.s., where s = G ^{1 — a/2) — r]. 


□ 


Proof of Theorem 2.1. Without loss of generality, we consider only the upper tail. 
Also, to simplify the notation, we drop out the G in the probability and the u that was 
used to distinguish between the notations for upper tail and lower tail. 

Define F{t) and Fn{t) by 


P{X > 7 ]) 




Let Fo{t) = 1 — e *. It is sufficient to prove that for every e > 0 there exists N such 
that for all n > N, 

sup ^Fo{t) - Fn{t}^ 


< e. 


Note that 

|Ao(t)-F„(t)| < 


Foit) - 


P(0< iX-r])/s<t) 


P{X > 7 ]) 

PiO < (X - r,)/s < t) P{0 < (X - rjn)/sn < t) 


P{X > 1]) P{X > 1]) 

P{0 < (X - fjn)/Sn < t) P(0 < (X - 17 „)/s„ < t) 


P{X > T]) P{X > fin) 

P(0 < (X - fln)/Sn <t) I Er=l ^(0 < - fln)/Sn < t) 


P{X > fin) P{X > fin) 

h ELl < {Xi - fln)/Sn <t) i E"=l ^(0 < {Xi - fln)/Sn < t) 


P{X > fin) 




By Assumption 2.1, A = 0. 

Note that 

B = -|F(0 < (X - ri)/s <t)- P{0 < (X - < t)\ 

a 

= -1 [G{st + ri)- G{snt + fln)]- [Girl) - G(? 7 „)] |. 
a 

Next, we show that supj |G(st + i]) — G{snt + fin)\ < ea/l and |G(r/) — G(r)„)| < ea/A. 

Given a small do > 0 such that s — do > c and — do > c for c > 0. Choose a 
large X > 0 such that for = [s — 5 q)K + ry — do, G{Ksq) > 1 — First, consider 
t > K. Since do > 0, we have st + ri > (s — do)K + (?] — do) = K^q, and therefore, 
Gist + ri) > G(X< 5 (j). Also, by Lemma 1.1, s„ —)• s a.s. and i]n ^ V a.s.. So, there 
exists Nq such that |.s„ — s| < do and [Vn — v\ < do for all n > Nq. So, we have 
Sn > s — 6o and fjn > V — do, which implies + fjn > is — do)X + (?] — do) = and 
Gisnt + fin) > G(X 5 j. Therefore, 

EOi 

sup{G(st + ri) - GiSnt + fin)} < —■ 
t>K 4 

Now, consider t < K. We have |(st+r/) —(.s„t+?))| < t|s—.s„| + |ry—j)„| < Xdo+do < di. 
Now by the uniform continuity of G, given e > 0, there exists Xi such that for n> Ni, 
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\{st + r]) — {snt + fin)\ < and therefore, \G{st + r]) — G{snt + fin)\ < Similarly, there 
exists N 2 such that for n > N 2 , \r] — i 7 „| < <5, and therefore, \G{r]) — G{fin)\ < 

So, with probability one, take N = max{fVo) -^ 2 } such that for n > iV, it implies 
that \{st + 7]) — [snt + f}n)\ < 5 and \ri — fjn\ <6. Then, we have 


B<- 

a 


sup \G{st + r]) - G{snt + fin)\ + \G{ri) - G'(??„)| 

t 


1 , sex sex , s 


Next, we have 


\G{snt + fi„) 'll 

^ ^ (1 - G^il - G{Vn)r^^^ 


1 - G{v )' 


a 4 


By the Gilvenko-Cantelli Theorem, with probability one, we can show that there 
exists N 3 such that for n > -/V 3 , supj |F(0 < (X — fjn)/sn < t) — ^ ^(0 < ~ 

fln)/sn <01 < if - Note that for large enough n, we have P{X > fjn) > f . So, 


P(0 < {X - fjn)/Sn < t) 
P{X > fin) 


\ < {Xj - fjn)/s„ < t) 

P{X > fin) 


2 sa 

< - 

ex 16 


e 

8 ’ 


Next, by the Gilvanko-Cantelli Theorem again, there exists X 4 such that for n > X 4 , 
|P(X > fjn) - ^J2i=iI(Xi > fln)\ < supOF(X > 0 - f > 01 < if- Then, 

we have 


E < (-^/(O < iXi-fjn)/Sn < 0) 

^ • 1 
2 = 1 


P{X >fjn)-k ELl nXj > fjn) 

PiX>f|n){},U=lHX^>f|n)) 


1 

2 = 1 

^ 2 sa s 
“ a 16 8 


P{X>fjn)-^EtlIiX^>f|n) 

PiX>ftn){}lj:t^rIiX.>fln)) 


Finally, take N = max{Xo, Xi, N2, X 3 , X 4 }, we have 


snp{Fit)-Fn{t)}<A + B + G + D + E<^- + ^- + ^- + ^-=s. 


□ 


1.2 Proof of Theorem 4.1 

Let (t/i,..., [/„)* be the matrix of zeros and ones with zero corresponding to a filtered 
component in (Xi,... ,X„)1 and no be the number of complete observations after the 
filter step. Now, let Cj = {*, 1 < i < n : Uij = 0} and C = So, Cj is the 

set of indices of filtered values for variable j, and G is the set of indices of incomplete 
observations. By Boole’s inequality, 

p 

n-no = #C 

j=i 


3 



Let ^ > 0 be as described in Section 3. Now, for each variable {Xy,..., j = 

1,... ,p, apply Theorem 2.1 to obtain Nj such that, with probability one, 

i^Cj < n^/p, for n> Nj. 

Set N = maxjiVi,..., Np\. Hence, with probability one, 

p p 

n - no < ^ #Cj < ^ n^/p = 
i=i j=i 

for n > N, or equivalently, 

^>l-e 

n 

Therefore, tf* = (1,..., 1)* according to (6), and U = I, where I has every entry equal 
to 1. In other words, for n > N, the GSE in Section 3 becomes 

rh = mGs(^,I) 

S = Sgs{^,1)- 

Since GSE on complete data reduces to the regular S-estimator (Danilov et ah, 2012), 
this implies that 3S-regression reduces to S-regression for n > N. 


4 



2 Additional figures from the simulation study in Section 
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Figure 1: MSE for various cellwise and casewise contamination values, k, for models 
with p = 15 continuous covariates. The sample size is n = 150. For details see Section 
5.1 in the paper. 
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Figure 2: MSE for various cellwise and casewise contamination values, k, for models 
with px = 12 continuous and pd = H dummy covariates. The sample size is n = 150. 
For details see Section 5.2 in the paper. 


3 Investigation on the performance on non-normal covari¬ 
ates 

Here, we conduct a modest simulation study to compare the performance of 3S-regression, 
the shooting S-estimator, 2S-regression and the LS estimator for data with non-normal 
covariates. 

We consider the same regression model with p — 15 and n = 300 as in Section 5, 
but the covariates are generated from a non-normal distribution as follows. The random 
covariates Xi^ i = l,...,n, are first generated from multivariate normal distribution 
iVp(0,11), where S is the randomly generated correlation matrix with a fix condition 
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number of 100. Then, we transform the variables by doing the following: 

{Xa,X,2, ...,Xip)^ (Gr'(d>(X,i)), G^\<^>{Xi2)), ■■■, G-\^X,p))), 

where <i>(a;) is the standard normal. We set Gj as Af(0,1) for j = 1,2,3, x^(20) for 
j = 4,5,6, T(90,10) for j = 7,8,9, x^(l) for j = 10,11,12, and Pareto(l,3) for j = 
13,14,15. 

In the simulation study, we consider the following scenarios: 

• Clean data: No further changes are done to the data; 

• Cellwise contamination: Randomly replace e = 0.05 fraction of the cells in the 
covariates by outliers X??”* = k x (0.999) and e proportion of the responses 
by outliers = E{Yij) + k x SD{ei). We present the results for A: = 1,5,10, 
but for larger values of k we obtain similar results. 

The number of replicates for each setting is X = 1000. 

The performance of the estimator in terms of MSE are summarized in Table 1. The 
performance of 3S-regression is comparable to that of LS and 2S-regression for clean 
data and outperforms the shooting S, LS and 2S-regression for cellwise-contaminated 
data, even under some deviations from the assumptions on the tail distributions of the 
covariates. 

Table 1: MSE for clean data and cell-wise contaminated data. 


Estimators 

Clean 

k = 2 

Cellwise 

fc = 5 fc = 10 

3S 

0.007 

0.014 

0.013 

0.015 

Shoots 

0.254 

0.839 

1.048 

0.882 

2S 

0.003 

4.102 

3.851 

4.057 

LS 

0.001 

4.311 

6.438 

6.588 


4 Further analysis of the Boston housing data 


Table 2: Description of the variables in the Boston Housing data 


Variables Description 

medv (response) corrected median value of owner-occupied homes in USD lOOO’s 

crim per capita crime rate by town 

nox nitric oxides concentration (parts per 10 million) 

rm average number of rooms per dwelling 

age proportion of owner-occupied units built prior to 1940 

dis weighted distances to five Boston employment centers 

tax full-value property-tax rate per USD 1,000,000 

ptratio pupil-teacher ratio by town 

black (B — 0.63)^ where B is the proportion of blacks by town 

Istat percentage of lower status of the population 


We now further illustrate how the propagation of cellwise outliers in the Boston 
housing data leads to the observed differences among the three robust estimators. 

Recall that half of the cases fully downweighted by 2S-regression have entries flagged 
as cellwise outliers. We replace these flagged cells by their best linear predictions (using 


6 



Table 3: Estimates and p-values of the regression coefficients for the imputed Boston 
Housing data. 


Variable 

3S 

Shoots 

2S 

LS 

Coeff. 

P-Val. 

Coeff. P-Val. 

Coeff. 

P-Val. 

Coeff. 

P-Val. 

log(lstat) 

-0.243 

0.000 

-0.264 

-0.227 

<0.001 

-0.385 

<0.001 

rm^ 

0.015 

0.000 

0.013 

0.014 

<0.001 

0.009 

<0.001 

tax 

-0.051 

0.000 

-0.030 

-0.047 

<0.001 

-0.032 

0.002 

log(dis) 

-0.125 

0.000 

-0.161 

-0.129 

<0.001 

-0.144 

<0.001 

ptratio 

-0.026 

0.000 

-0.028 

-0.025 

<0.001 

-0.027 

<0.001 

nox^ 

-0.578 

0.013 

-0.522 

-0.619 

0.010 

-0.479 

<0.001 

age 

-0.023 

0.645 

-0.037 

-0.037 

0.471 

0.051 

0.386 

black 

-0.726 

0.398 

0.371 

-0.882 

0.376 

-0.206 

0.519 

log(crim) 

-0.006 

0.513 

-0.001 

-0.012 

0.233 

-0.012 

0.213 


Table 4: Pairwise squared norm distances between the estimates for the imputed Boston 
housing data. 


3S 

Shoots 

2S 

LS 

3S 

0.862 

0.172 

5.486 

Shoots 

- 

1.158 

3.992 

2S 


- 

6.366 

LS 



- 


the 3S-regression estimate) and then, refit the model with the four considered estimators. 
The resulting coefficient estimates and their P-values are given in Table 3. Notice that 
the covariate age is no longer significant under 2S-regression. Moreover, Table 4 shows 
the norm distances between all the estimates calculated from such imputed data. Now, 
2S-regression is considerably closer to the cellwise robust estimators, and it no longer 
fully down-weights the cases formerly affected by cellwise outliers (the median weight 
of these cases is now 0.64, closer to the overall median weight, 0.69). The LS estimator 
remains different from the robust estimators, possibly due to the existence of casewise 
outliers in the data. MM-regression (Yohai, 1985) behaves similarly to 2S-regression in 
this example. 
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